Improving protein function prediction by learning and integrating representations of protein sequences and function labels

Abstract Motivation As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt. Results We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms. Availability and implementation https://github.com/BioinfoMachineLearning/TransFew.


Supplementary Note 5: Evaluation Metrics
In this work, we use the three CAFA [4,20] evaluation metrics: F max , S min , weighted F max and the area under the precision-recall curve (AUPR) to evaluate protein function predictions, which are defined as follows.
• Precision where f is a term, P i (τ) is the set of predictions, T i denotes the corresponding ground-truth, i represents the protein sequence under consideration, and τ is the decision threshold.m(τ) is the number of proteins sequences with at least one predicted score greater than or equal to the decision threshold τ, I(•) is an indicator function, and n e is the number of proteins in the test set for a particular test study.
• Information Content (ic) of term f is computed as: • Weighted precision: • Weighted Recall: Here, Pr( f |P( f )) represents the probability that term f in the ontology is associated with a protein given that all of its parents are associated.
• Remaining Uncertainty • Area under precision recall curve (AUPR) where Precision(R) represents the precision at a given recall level (R).S2.
Evaluation was performed using the CAFA-evaluator [11], with the best score reported for each metric (Fmax, weighted Fmax, and S min ).Additionally, we compute the AUPR for each method using the trapezoidal rule on the precision and recall values provided by the CAFA-evaluator, without interpolating the precision-recall curve at the extremes.For Information Accretion, the InformationAccretion  [8,13,15,17,18].Test predictions were obtained through the NetGO3 web server SPROF-GO: An alignment-free method employing a pre-trained protein language model to extract informative sequence embeddings.It utilizes self-attention pooling to focus on crucial residues and integrates homology information using a label diffusion algorithm [19].Test predictions were obtained through the provided web server for SPROF-GO.
DeepGO-SE: Utilizes a pre-trained large protein language model combined with GO background knowledge and protein-protein interactions (PPIs) to make accurate predictions about protein functions [6].Predictions for DeepGO-SE were generated by cloning and running the tool locally.

Supplementary Note 7: Experiment on Label Embedding
To test if using the label encoder improved prediction accuracy, we replaced the embedding of the label encoder with a random matrix for each GO partition group to check how the performance was changed.Specifically, for Partition x, where x = 1, 2 for MF and CC, and x = 1, 2, 3 for BP, the embedding of all GO terms in the partition/group was substituted by a random matrix.For instance, for Partition 1 in CC, we replaced the label encoder for all 873 GO terms with a random matrix.
The results after replacing the GO term encoding with random matrix are compared with TransFew of employing the label encoder are shown in the table below.The results show that using the label encoder to generate the representation for GO terms improves the prediction accuracy across the board.
Table S3: The performance of different implementations of TransFew on the Test all dataset that use a random matrix to replace the embedding of GO terms in a partition generated by the label encoder in comparison with the final TransFew that uses the label encoder.

Figure S1 :
Figure S1: The training and validation curves for TransFew (in red) and TransFew + InterPro + MSA (called Combined).The sub-figures on the left are for the GO terms with annotation frequency greater than or equal to 30 (left) and the sub-figures on the right are for rare GO terms with annotation frequency with less than 30.Throughout the training process (solid lines), TransFew + InterPro + MSA consistently fits the training data better than TransFew, but on the validation data (dashed lines) Transfew performs better.

Figure S2 :
Figure S2: The Average Precision (AP) and Area under the ROC curve (AUC) of three graph neural network-based auto-encoders for three GO categories (molecular function (top), cellular component (middle), and biological process (bottom)).The three encoding architectures have similar performance..
repository[10] was utilized.We compared TransFew with six baseline methods, namely Naive, DiamondBLAST, Tale, NetGO 3.0, DeepGO-SE, and SPROF-GO.Here's a concise overview of each method: Naive: The Naive method simply uses the frequency of Gene Ontology (GO) terms in the training dataset to make predictions.DiamondBLAST: Based on sequence similarity scores obtained through BLAST, it identifies similar sequences from the training set and transfers annotations from the most similar ones [2, 7].Tale: A transformer-based method integrating protein sequence and label features to predict protein function by jointly embedding sequence and hierarchical label information [3].Predictions for Tale were generated by downloading the code from GitHub and running it locally.NetGO 3.0: An ensemble method combining outputs from seven individual function prediction methods using various input sources, including Naive prediction, BLAST-KNN, LR-3mer, LR-InterPro, Net-KNN, LR-Text, and LR-ESM

Table S2 :
The number of proteins in training, validation, Test all, and Test novel datasets for each of the three GO categories (BP, MF, and CC).Additionally, we show the number of proteins in the Trembl.GPI).Following this, sequences in Test all with over 30% identity to the training data were filtered out using MMseqs, resulting in Test novel.Dataset statistics are provided in Table